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Abstract 

In this paper we present a probabilistic model for constraint-based grammars and a method 
for estimating the parameters of such models from incomplete, i.e., unparsed data. Whereas meth- 
ods exist to estimate the parameters of probabilistic context-free grammars from incomplete data 
(Q), so far for probabilistic grammars involving context-dependencies only parameter estimation 
techniques from complete, i.e., fully parsed data have been presented However, complete-data 
estimation requires labor-intensive, error-prone, and grammar-specific hand-annotating of large 
language corpora. We present a log-linear probability model for constraint logic programming, and 
a general algorithm to estimate the parameters of such models from incomplete data by extending 
the estimation algorithm of Q to incomplete data settings. 

Abstract 

Diese Arbeit prasentiert ein probabilistisches Modell fur kontext-sensitive, constraint-basierte 
Grammatiken und erstmals eine Methode, die Parameter solcher probabilistischer Modelle an- 
hand unvollstandiger Daten einzuschatzen. Probabilistische Grammatiken werden hier in einem log- 
linearen Wahrscheinlichkeitsmodell fur Constraint Logik Programmierung formalisiert. Die prasen- 
tierte Parameter-Schatzmethode ist eine Erweiterung des Improved Iterative Scaling- Algorithmus 
von pj fur Parameterschatzung anhand unvollstandiger Daten. Die vorgestellten Methoden ermoglichen 
die probabilistische Modellierung verschiedenster constraint-basierter Grammatiken und ein au- 
tomatisches Training solcher Modelle anhand ungeparster Sprachdaten. 



1 Introduction 

Probabilistic grammars are of great interest for computational natural language processing 
(NLP), e.g., because they allow the resolution of structural ambiguities by a probabilistic 
ranking of competing analyses. A prerequisite for such applications is parameter estima- 
tion, i.e, a method to adapt the model parameters to best account for a given language 
corpus. Clearly, an estimation technique similar to the well-known maximization tech- 
nique of for context-free models would be desirable also for constraint-based models. 
Baum's maximization technique permits model parameters to be efficiently estimated 
from incomplete, i.e., unparsed data rather than from complete, i.e., fully parsed data. 
Recently, an attempt to apply this estimation technique to a probabilistic version of the 
constraint logic programming (CLP) scheme of flj| has been presented by [pLT| | . As rec- 



ognized by Eisele, there is a context-dependence problem associated with applying this 
technique to constraint-based systems. That is, incompatible variable bindings can lead 
to failure derivations, which cause a loss of probability mass in the estimated probability 
distribution over derivations. This probability leakage prevents the estimation procedure 
from yielding the desired maximum likelihood values in the general case. A similar prob- 
lem troubles every attempt to embed Baum's maximization technique into an estimation 
procedure for probabilistic analogues of a constraint-based processing systems (see, e.g., 
||, 0, 0, fll6f , or [TriPPl From a mathematical point of view, all such constraint-based 
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approaches contradict the inherent assumptions of Baum's maximization technique which 
require that the derivation steps are mutually independent and that the set of licensed 
derivations is unconstrained. Only recently, [1] has shown how to overcome this prob- 
lem by using the algorithm of |§ for estimation. This method, however, applies only to 
complete data. 

Unfortunately, the need to rely on large samples of complete data is impractical. For 
parsing applications, complete data means several person-years of hand-annotating large 
corpora with specialized grammatical analyses. This task is always labor-intensive, error- 
prone, and restricted to a specific grammar framework, a specific language, and a specific 
language domain. Clearly, flexible techniques for parameter estimation of probabilistic 
constraint-based grammars from incomplete data are desirable. 

The aim of this paper is to solve the problem of parameter estimation from incomplete 
data for probabilistic constraint-based grammars. For this aim, we present a log-linear 
probability model for CLP. CLP is used here to provide an operational treatment of 
purely declarative grammar frameworks such as PATR, LFG or HPSGQ. A probabilistic 
CLP scheme then yields a formal basis for probabilistic versions of various constraint- 
based grammar formalisms. The probabilistic model defines a probability distribution 
over the proof trees of a constraint logic program on the basis of weights assigned to 
arbitrary properties of the trees. In NLP applications, such properties could be, e.g., 
simply context-free rules or context-sensitive properties such as subtrees of proof trees 
or non-local head-head relations. The algorithm we will present is an extension of the 
estimation method for log-linear models of || to incomplete-data settings. Furthermore, 
we will present a method for automatic property selection from incomplete data. 

The rest of this paper is organized as follows. Section ^| introduces the basic formal 
concepts of CLP. Section ^ presents a log-linear model for probabilistic CLP. Parameter 
estimation and property selection of log-linear models from incomplete data is treated in 
Sect. f|. Concluding remarks are made in Sect. [|. 



2 Constraint Logic Programming for NLP 



In the following we will quickly report the basic concepts of the CLP scheme of ||12|| . A 
constraint-based grammar is encoded by a constraint logic program V with constraints 
from a grammar constraint language C embedded into a relational programming constraint 
language 71(C) . 

Let us consider a simple non-linguistic example. The program of Fig. |l| consists of five 
definite clauses with embedded £ -constraints from a language of hierarchical types. The 

1 Moreover, even approaches such as that of where the derivation steps are in fact context-free, must be 
characterized as constraint-based and exhibit a similar problem because discarding derivations incompatible with 
the bracketing of a training corpus from the estimation procedure also induces a problem of loss of probability 
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An example for an embedding of feature-based constraint languages into the CLP scheme of |12j is the formalism 
CUF (juj). 
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ordering on the types is defined by the operation of set inclusion on the denotations (•') 
of the types and a'Cc'C e', b' C d! C e', and c' n d' = 0. 

s{Z)^p{Z)&cq{Z). 
p(Z) <- Z = a. 
p{Z) «- Z = 6. 

<- Z = a. 

<- Z = b. 



Figure 1: Simple constraint logic program 

Seen from a parsing perspective, an input string corresponds to a an initial goal or 
query G which is a possibly empty conjunction of £ -constraints and 7Z{£) -atoms. Parses 
of a string (encoded by G) as produced by a grammar (encoded by V ) correspond to 
V -answers of G. A V -answer of a goal G is defined as a satisfiable /^-constraint s.t. 
the implication <p — > G is a logical consequence of V . The operational semantics of 
conventional logic programming, SLD- resolution (||14||), is generalized by performing goal 



reduction only on the TZ(C) -atoms and solving conjunctions of collected /^-constraints 
by a given £ -constraint solver. An example for queries and proof trees for the program 
of Fig. m is given in Fig. ^. 

In the following it will be convenient to view the search space determined by this 
derivation procedure as a search of a tree. Each derivation from a query G and a program 
V corresponds to a branch of a derivation tree, and each successful derivation to a subtree 
of a derivation tree, called a proof tree, with G as root note and a V -answer as terminal 
node. We assume each parse of a sentence to be associated with a single proof tree. In 
order to rank parses in terms of their likelihood, we define a probability distribution over 
proof trees. To this end we propose a log-linear model. 



3 A Log-Linear Probability Model for CLP 

Log-linear models are powerful exponential probability distributions which define the 
probability of an event as being proportional to the product of weights assigned to selected 
properties of the event]]. For our application, the special instance of interest is a log-linear 
distribution over the countably infinite set of proof trees for a set of queries to a program. 
Log-linear distributions take the following form. 

Definition 1. A log-linear probability distribution p\ on a set Q is defined s.t. for all 
uo e Q: 

p x (co) = Z x - l e x ^p (iu), 
Z\ = z^2upn eXv Poi 1 ^) is a normalizing constant, 

3 Log-linear models emerged in statistical physics as Gibbs- or Boltzmann-distribution and can be interpreted 
also as maximum entropy distributions UM. 
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A = (Ai, . . . , A n ) G H n is a vector of log-parameters, 

v = {yx, . . . , v n ) is a vector of property-functions s.t. for each : Q — > IN, v,i{uo) is the 
number of occurences of the i-th property in uj, 

A ■ v(u) is a weighted property-function s.t. A ■ v(u) = Y17=i ^1^(^)7 
Po is a fixed initial distribution. 

When we search for a proper probability distribution over given training data in a 
maximum likelihood estimation framework, we want to find a distribution reflecting the 
statistics of the training corpus. This means, we have to choose useful properties (prop- 
erty selection) and appropriate corresponding log-parameters (parameter estimation). A 
definition of properties convenient for our application is as subtrees of proof trees. 

3 X y x : 1 X y 2 : 4 X y 3 : 1 X j/ 4 : 1 X y 5 : 

s(Z) & Z = a s(Z) & Z = b s(Z) & Z = c s(Z) & Z = d s(Z) & Z = e 

p(Z) & q(Z) & Z = a p(Z) & q(Z) & Z = b p(Z) & q(Z) & Z = c p(Z) & q(Z) & Z = d p ( Z^&^0Z)^&Z^= e 

<j(Z) & Z = a q(Z) & Z = b q(Z) & Z = a <j(Z) UZ = b q(Z) & Z = a <j(Z) & Z = b 



Figure 2: Queries and proof trees for constraint logic program 

Suppose we have a training corpus of ten queries, consisting of three tokens of query 
yi : s(Z) & Z = a, four tokens of y 3 : s(Z) & Z = c, and one token each of query y 2 : 
s(Z) & Z = b, ?/ 4 : s(Z) & Z = d, and y 5 : s(Z) & Z = e. The corresponding proof trees 
generated by the program in Fig. are given in Fig. ||. Note that queries y±, y 2 , yz and 
7/4 are unambiguous, being assigned a single proof tree, while y$ is ambiguous. 

A first useful distinction between t he proof trees of Fig. ||] can be obtained by selecting 



the two subtrees t\ : Z = a and t 2 : Z = b as properties. This allows us to cluster the 



proof trees into two disjoint sets on the basis of their similar statistical qualities. Since in 
our training corpus seven out of ten queries come unambiguously with a proof tree includ- 
ing property t\, we would expect the maximum likelihood parameter value corresponding 
to property t\ to be higher than the parameter value of property t 2 . However, we cannot 
simply recreate the proportions of the training data from the corresponding proof trees 
because we do not know the frequency of the possible proof trees of query y 5 . A solution 
to this incomplete-data problem is presented in the next section. 



4 Inducing Log-Linear Models from Incomplete Data 

As shown in the foregoing example, statistical inference for log-linear models involves two 
problems: parameter estimation and property selection. In the following, we will present 
the details of an algorithm to solve to these problems in the presence of incomplete data. 



Statistical Inference and Probabilistic Modeling for Constraint-Based NLP 



5 



4.1 Parameter Estimation 

The "improved iterative scaling" algorithm presented by || solves a maximum likelihood 
estimation problem for log-linear models with respect to complete dataQ. This algorithm 
itself is an extension of the "generalized iterative scaling" algorithm of [0] especially 
tailored to estimating models with large parameter spaces. We present a version of the 
first algorithm specifically designed for incomplete data problems. A proof of monotonicity 
and convergence of the algorithm is given in the appendix, i.e, we show that succesive 
steps of the algorithm increase the incomplete-data log-likelihood and eventually lead to 
convergence to a (local) maximum. 

Applying an incomplete-data framework to a log-linear probability model for CLP, we 
can assume the following to be given: 

• observed, incomplete data y £ y, corresponding to a given, finite sample of queries 
for a constraint logic program V , 

• unobserved, complete data x & X, corresponding to the countably infinite sample 
of proof trees for queries y from V , 

• a many-to-one function Y : X —>■ y s.t. Y(x) = y corresponds to the unique query 
labeling proof tree x, and its inverse X : y —>■ X s.t. X(y) = {x\ Y(x) = y} is the 
countably infinite set of proof trees for query y from V , 

• a complete-data specification p\(x), which is a log-linear distribution on X with 
given initial distribution p , fixed property-functions vector u, and depending on 
parameter vector A, 

• an incomplete-data specification p\(y), which is related to the complete-data speci- 
fication by p x (y) = T,xex( y )P^( x )- 

For the rest of this section we will refer to a given vector v of property functions, 
which is assumed to result from the property selection procedure presented below. If the 
incomplete-data log-likelihood function L is defined over a sample y of tokens of queries 
y s.t. L(X) = lnY[ ye yP\{y), then the problem of maximum-likelihood estimation for log- 
linear models from incomplete data can be stated as follows. Given a fixed ^-sample and 
a set A = {A| p\{x) is a log-linear distribution on X with fixed po, fixed v and A G IR™}, 
we want to find a maximum likelihood estimate A* of A s.t. A* = argmax AgA L(X). 

The key idea of the proposed method is to iteratively maximize a strictly concave 
auxiliary function when the non-concave log-likelihood function cannot be maximized 
analytically. Following ||, we can define an auxiliary function A directly as a lower bound 
on L(7 + A) — L(A), i.e., as a conservative estimate of the difference in log-likelihood when 

4 For complete data, this is equivalent to solving a constrained maximum entropy problem. 
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going from a basic model p\ to an extended model p 7 +aQ- The specific design of A for 
incomplete data can be derived from the complete data case, in essence, by replacing 
an expectation of complete, but unobserved, data by a conditional expectation given the 
observed data and the current parameter values^. Let A G A, 7 6 R n . Then 

A(j, A) = £,^(1 +k\[j-v\- Px ELi ^ liV *\)- 
A is maximized in 7 at the unique point 7 satisfying for each %: 

An iterative algorithm for maximizing L is constructed from A as follows. For the want 
of a name, we will call this the "Iterative Maximization (IM)" algorithm. 

Definition 2 (Iterative maximization). Let M. : A — > A be a mapping defined by 
Ai(X) = 7 + A with 7 = argmax 7g]R n A(7, A). Then each step of the IM algorithm is 
defined by X^ = M(X^). 

As shown in the appendix, this procedure stepwise increases the log-likelihood function 
L and eventually converges to a (local) maximum of L. For large configuration spaces X 
the expectations to be calculated can get intractable. Here approximations by conditional 
models or Monte Carlo methods have to be used. 

4.2 Property Selection 

A further problem is that exhaustive sets of properties can get unmanageably large. Let 
properties of proof trees be defined as connected subgraphs of proof trees, and suppose 
that properties can incrementally be constructed by selecting from an initial set of goals 
and from subtrees built by performing a resolution step at a terminal node of a subtree 
already in the model. Clearly, the exponentially growing set of possible properties must 
be pruned by some quality measure. An appropriate measure can then be used to define 
an algorithm for automatic property selection. 

A straightforward measure to take would be the improvement in log-likelihood when 
extending a model by a single candidate property c with corresponding parameter a. This 
would require iterative maximization for each candidate property and is thus infeasible. 
Following || , we could instead approximate the improvement due to adding a single prop- 
erty by adjusting only the parameter of this candidate and holding all other parameters 
of the model fixed. Unfortunately, the incomplete-data log-likelihood L is not concave in 

5 Another possibility to arrive at the same auxiliary function is to use the complete-data auxiliary function of 
H in the M-step of a generalized EM algorithm This guarantees monotonicity of the resulting algorithm, 
but convergence yet has to be proved. Our approach views the incomplete-data auxiliary function directly as 
a lower bound on the improvement in incomplete-data log- likelihood, which enables an intuitive and elegant 
proof of convergence. 

6 k\{x\y) — p\{x)/ J2 xeX (y) P*( x ) ls the conditional probability of the complete data x given the observed data 
y and the current fit of the parameter A. Furthermore, p[f] — X^gn Pi^ffa) is the expectation of a function 
/ : fl — > ]R with respect to a probability distribution p on a set O, v#(x) = Ys2=i v i{ x )i Vi{ x ) — v A x ')l v #i x )- 
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the parameters and thus cannot be maximized directly. However, we can instantiate the 
auxiliary function A used in parameter estimation to the extension of a model p\ by a 
single property c with log-parameter a, i.e., we can express an approximate gain G c (a, A) 
of adding a candidate property c with log-parameter value a to a log-linear model p\ as 
a conservative estimate of the true gain in log-likelihood as follows. 

G c (a, A) = Zyeyi 1 + k M - Px [e ac ]). 
G c (a, A) is maximized in a at the unique point a satisfying 

Property selection will incorporate that property out of the set of candidates that 
gives the greatest improvement to the model at the property's best adjusted parameter 
value. Since we are interested only in relative, not absolute gains, a single, non-iterative 
maximization of the approximate gain will be sufficient to choose from the candidates. 

4.3 Combined Statistical Inference 

A combined algorithm for statistical inference for log-linear models from incomplete data 
is as follows. 

Input Initial model po, multiple incomplete-data sample y. 

Output Log-linear model p* on complete-data sample X with selected property function 
vector v* and estimated log-parameter vector A* = argmax AeA L(X) where A = 
{A| p\ is a log-linear model on X based on p , u* and A G R n }. 

Procedure 

1. p (0) := po with C (0) := 0, 

2. Property selection: For each candidate property c G compute the gain 
Gc(^) '■= max Qg]R G c (a, A^), and select the property c := arg max cgC .( t ) G c (\^). 

3. Parameter estimation: Compute a maximum likelihood parameter value A := 
argmax AgA L(X) where A = {A| p\(x) is a log-linear distribution on X with 
initial model po, property function vector v := v^> + c, and A G H n }. 

4. Until the model converges, set 
P it+1) -=Px.u, t:=t+l, go to 2. 

Let us return to the example of Sect. |3| and apply the IM algorithm to the incomplete- 
data problem stated there. For the selected properties t 1 and t 2 , we have v#(x) = y\{x) + 
^2 (#) — 1 for all possible proof trees x for the sample of Fig. |^. Thus the parameter 
updates % can be calculated from a particularly simple closed form as follows. 

li = r—y 
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A sequence of IM iterates up to stability in the third place after the decimal point of the 
incomplete-data log-likelihood is given in Table [I]. Probabilities of proof trees involving 
property U are denoted by pi. Starting from an initial uniform probability of 1/6 for 
each proof tree, this estimation sequence converges to the desrided accuracy after three 
iterations and yield probabilities p\ ~ .259 and P2 ~ .074 for the respective proof trees. 



Iteration t 


A< 4) 


A 2 


(*) 
Pi 


(*) 


L(A (t) ) 











1/6 


1/6 


-17.224448 


1 


In 1.5 


In .5 


.25 


.083 


-15.772486 


2 


In 1.55 


In .45 


.2583 


.075 


-15.753678 


3 


In 1.555 


In .445 


.25916 


.07416 


-15.753481 



Table 1: Estimation using the IM algorithm 



5 Conclusion 

We have presented a probabilistic model for CLP, coupled with an algorithm to induce 
the parameters and properties of log-linear models from incomplete data. This algorithm 
is applicable to log-linear probability distributions in general, and has been shown here 
to be useful to estimate the parameters of probabilistic context-sensitive NLP models. In 
contrast to related approaches such as that of [0 or Jl5|], our statistical inference algorithm 
provides the means for automatic and reusable training of probabilistic constraint-based 
grammars from unparsed corpora. 

Furthermore, heuristic search algorithms for finding the most probable analysis in the 
CLP model can be based upon this probability model. For example, a combination of 



the dynamic-programming techniques of Earley deduction |19| and Viterbi-searching f20 
could be employed. Depending on the class of constraint-based grammars under consid- 
eration, a considerable gain in search efficiency can be obtained^]. 

The statistical inference algorithm presented is fully implemented and has already 
been tested empirically with simple examples. Clearly, the performance of the presented 
techniques in real-world NLP problems has to be thoroughly investigated. Unfortunately, 
the current availability of broad-coverage constraint-based grammars limited so far the 
empirical evaluation of the presented techniques for the area of constraint-based parsing. 
In future work we also will investigate the applicability of the statistical methods here 
described to NLP problems other than constraint-based parsing. 



7 However, note that the choice of a particular class of constraint-based grammars also influences the behaviour 
of the algorithm in finding the optimal analysis. For example, in grammars where variable-bindings are ignored 
in Viterbi-searching in order to avoid the overhead of storing each variable binding separately the problem of 
pursuing a non-optimal path arises. In such cases only approximate heuristic searching can be done (see Q for 
a similar approach to a Viterbi-like heuristic search procedure for unification-based grammars). 
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A Iterative Maximization: Propositions and Proofs 

In the following, we assume that for each property function Vi some proof tree x G X with Vi(x) > exists, and 
require px to be strictly positive on X, i.e., px(x) > for all x G X . Furthermore, p 1 +x{x) — Z- yo x~ 1 e J v p\(x) 
denotes an extended log-linear model with Z 10 \ — px[e J '"]. 

Lemma [j] shows that the auxiliary function A(j, A) is a lower bound on the incomplete-data log-likelihood 
difference L(7 + A) — L(X). 

Lemma 1. A(-y, A) < L(j + A) - L(A). 

Proof. 



L( 7 +A)-L(A) = ^ (In 



yey 

E( ln ^y E (^(^ 



Pa(z) 



Ed- E ( 



PaQe) p 7+A (z 
Px(y) P\{x) 



-)) 



f Px(x) ln P 7 +aQe) 
Pa(j/) Pa (a;) 



E( E 
E( E 



)) by Jensen's inequality 



v pa(v) 
/Pa (as) , 



(lnp J+x (x) -lnpx(x)))) 



E( E (fffan^+lne^'+lnp^xO-lnp.M))) 



E^ 7 ' ^ 

yey 

E^ 7 ' v 

yey 

E^ 7 ' 

yey 

J2(kx[~f ■ v 
yey 

yey 

E^ 7 ' V 

yey 

4( 7 , A). □ 



-]np A [e T "]) 

+ 1 — Pa[c 7 "]) since lnx < x — 1 

+ l-£(p A ( :B ) e E ~™ ( * ) ^)) 

+ l-^( P 4 I )e E '= l7 '^ W "# W )) 
xex 

n 

+ 1 — 53 (pa(k) 53 ^i( :E ) e7 ^ #< " a: ')) by Jensen's inequality 
xex j=i 

n 

+ i- PA [53p ie ^#]) 



Lemma ^ shows that there is no estimated improvement in log-likelihood at the origin, and Lemma |§] shows 
that the critical points of interest are the same for A and L. 

Lemma 2. 4(0, A) = 0. 



Theorem ^| shows the monotonicity of the IM algorithm. 

Theorem 4. For all A G A; L(A4(X)) > L(\) with equality iff X is a fixed point of M or equivalently is a critical 
point of L. 
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Proof. 

L{M(X))-L(X) > A(M(X)) byLemma[l] 

> by Lemma ^| and definition of M . 

The equality L(M(X)) = L(X) holds iff A is a fixed point of M, i.e., M(X) = 7 + A with 7 = 0. Furthermore, A is 
a fixed point of M iff 7 = arg max ^(7, A) = 0, 

for all 7 e IR n : i = arg max A(tj, A) = 0, 
tem 

<=>■ for all 7 e IR" : f t \ t=0 A(tj, A) = 0, 

for all 7 £ H n : ^| t=Q L(*7 + A) = 0, by Lemma | 
<=> A is a critical point of L. □ 



Corollary q implies that a maximum likelihood estimate is a fixed point of the mapping M. 
Corollary 5. Let A* = argmax L(X). Then A* is a fixed point of M. 

Theorem ^ discusses the convergence properties of the IM algorithm. In contrast to the improved iterative 
scaling algorithm, we cannot show convergence to a global maximum of a strictly concave objective function. 
Rather, similar to the EM algorithm, we can show convergence of a sequence of IM iterates to a critical point of 
the non-concave incomplete-data log-likelihood function L. 

Theorem 6 (Convergence). Let {A^'} be a sequence in A determined by the IM Algorithm. Then all limit 
points of {X' k '} are fixed points of M or equivalently are critical points of L. 

Proof. Let {A' fe "'} be a subsequence of {A^- 1 } converging to A. Then for all 7 G IR n : 

A(7,A (fen) ) < A(7 (fcn) ,A (fcn) ) by definition of M 

< L( 7 (fc " ) +A (fe ' l) )-L(A (fc ' l) ) by Lemma § 
= L(A (fe " +1) ) - L(A (fcn) ) by definition of IM 

< L(A (fc " +l) ) - L(A (fc "') by monotonicity of L(A (fe) ), 

and in the limit as n — > 00, for continuous A and L: Afa, A) < L(X) — L(A) = 0. Thus 7 = is a maximum of 
A(j, A), using Lemma ^, and A is a fixed point of M. Furthermore, ^ | A(tj, A) = ^ | L(t"f + A) = 0, using 
Lemma ^, and A is a critical point of L. □ 

From this and Theorem [| it follows immediately that each sequence of likelihood values, for which an upper 
bound exists, converges monotonically to a critical point of L. 

Corollary 7. Let {L(X^ k '} be a sequence of likelihood values bounded from above. Then {L(X^} converges mono- 
tonically to a value L* = L(X*) for some critical point A* of L. 
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